Retrieving Time from Scanned Books
نویسندگان
چکیده
While millions of scanned books have become available in recent years, this vast collection of data remains under-utilized. Book search is often limited to summaries or metadata, and connecting information to primary sources can be a challenge. Even though digital books provide rich historical information on all subjects, leveraging this data is difficult. To explore how we can access this historical information, we study the problem of identifying relevant times for a given query. That is – given a user query or a description of an event, we attempt to use historical sources to locate that event in time. We use state-of-the-art NLP tools to identify and extract mentions of times present in our corpus, and then propose a number of models for organizing this historical information. Since no truth data is readily available for our task, we automatically derive dated event descriptions from Wikipedia, leveraging the both the wisdom of the crowd and the wisdom of experts. Using 15,000 events from between the years 1000 and 1925 as queries, we evaluate our approach on a collection of 50,000 books from the Internet Archive. We discuss the tradeoffs between context, retrieval performance, and efficiency.
منابع مشابه
A Device for automated Scanning of Books
Fast access to older theses and dissertations is still difficult, because they are often available only in printed form. If in digital form, content of these books could be available more easily. However, scanning of bounded books is a time-consuming and costly process the pages of the book must be turned manually. At the Institute for Print and Media Technology, a device was developed, which c...
متن کاملEnhancing Readability of Scanned Picture Books
ABSTRACT We describe a system that enhances the readability of scanned picture books. Motivated by our website of children’s books in the International Children's Digital Library, the system separates textual from visual content which decreases the size of the image files (since their quality can be lower) while increasing the quality of the text by displaying it as computer-generated text inst...
متن کاملSpeech balloon contour classification in comics
Comic books digitization combined with subsequent comic book understanding create a variety of new applications, including mobile reading and data mining. Document understanding in this domain is challenging as comics are semi-structured documents, combining semantically important graphical and textual parts. In this work we detail a novel approach for classifying speech balloon in scanned comi...
متن کاملA Novel Technique for ECG Morphology Interpretation and Arrhythmia Detection Based on Time Series Signal Extracted from Scanned ECG Record
Cardiovascular disease (CVD) is the one of the biggest health problem in Indian and around the world as well. Electrocardiogram is a traditional method used for the diagnosis of heart diseases for about a century. Maintaining and retrieving patient history during a course of treatment is a essential but a laborious process. More particularly, over a decade ago thermal ECG records were stored ph...
متن کاملA Metadata Generation System for Scanned Scientic Volumes
Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the proble...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015